Skip to content

perf(github): Cache accessible repos for accessibleOnly search#112548

Merged
jaydgoss merged 18 commits intomasterfrom
jaygoss/vdy-68-perfgithub-optimize-accessibleonly-repo-search-to-avoid
Apr 10, 2026
Merged

perf(github): Cache accessible repos for accessibleOnly search#112548
jaydgoss merged 18 commits intomasterfrom
jaygoss/vdy-68-perfgithub-optimize-accessibleonly-repo-search-to-avoid

Conversation

@jaydgoss
Copy link
Copy Markdown
Member

@jaydgoss jaydgoss commented Apr 8, 2026

Summary

  • The OrganizationIntegrationReposEndpoint (/integrations/{id}/repos/) lets the frontend search for GitHub repos available to a GitHub App installation. When called with accessibleOnly=true and a search query (as the SCM onboarding repo selector does on each debounced keystroke), the previous implementation fetched all installation-accessible repos from the GitHub API (up to 50 pages of 100 = 5,000 repos) on every request, then filtered with a Python list comprehension
  • Cache the full repo list in sentry.cache.default_cache (Redis) for 5 minutes, and filter locally on subsequent requests — reducing each typed query from O(pages) GitHub API calls to zero

Test plan

  • Existing get_repositories tests pass (6/6)
  • New test_get_repositories_accessible_only_caches_repos verifies cache hit path skips /installation/repositories calls
  • Manual testing: second accessibleOnly search returns instantly from cache

Refs VDY-68

jaydgoss added 2 commits April 8, 2026 16:59
When accessibleOnly=true with a search query, the old path fetched all
installation repos (up to 5,000) on every debounced keystroke, then
filtered with a Python list comprehension. Replace this with a cached
set of accessible repo IDs (5-min Redis TTL) combined with the GitHub
Search API, reducing each typed query from O(pages) API calls to a
single search call plus a Redis lookup.

Refs VDY-68
…h API

Switch from Search API + cached ID set to caching the full repo list
and filtering locally. This avoids the Search API's shared 30 req/min
rate limit and uses sentry.cache.default_cache (Redis-backed) instead
of django.core.cache (DummyCache in Sentry).

Refs VDY-68
@linear-code
Copy link
Copy Markdown

linear-code bot commented Apr 8, 2026

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 8, 2026
jaydgoss added 2 commits April 8, 2026 17:57
Keep the cached repo list unfiltered so the cache is a faithful
snapshot of the GitHub API response. Apply the archived filter in
get_repositories alongside the other transforms. Also let the
accessible_only path handle both with and without a query.

Refs VDY-68
The Search API does not return archived repos, so the archived filter
should only apply to the /installation/repositories paths.
Move no-query path first since accessible_only is only useful with a
query (repeated keystrokes). Combine archived and query filters into
a single pass through to_repo_info.
Strip raw GitHub repo dicts down to the 5 fields used by
get_repositories before storing in the cache. Reduces per-integration
cache size from ~3KB per repo to ~100 bytes.
getsentry configures CACHES with memcached in production, so
django.core.cache.cache works and matches the pattern used by the
rest of the integrations codebase.
@jaydgoss jaydgoss marked this pull request as ready for review April 9, 2026 15:42
@jaydgoss jaydgoss requested a review from a team as a code owner April 9, 2026 15:42
Comment on lines +561 to +587
def get_accessible_repos_cached(self, ttl: int = 300) -> list[CachedRepo]:
"""
Return all repos accessible to this installation.
Cached in Django cache for ``ttl`` seconds so that debounced
search keystrokes don't re-fetch all pages from GitHub.

Only the fields used by get_repositories() are stored to keep
the cache payload small.
"""
cache_key = f"github:accessible_repos:{self.integration.id}"
cached = cache.get(cache_key)
if cached is not None:
return cached

all_repos = self.get_repos()
repos: list[CachedRepo] = [
{
"id": r["id"],
"name": r["name"],
"full_name": r["full_name"],
"default_branch": r.get("default_branch"),
"archived": r.get("archived"),
}
for r in all_repos
]
cache.set(cache_key, repos, ttl)
return repos
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really get_accessible_repos_cached, it's get_repos_cached

r
for r in all_repos_cached
if not r.get("archived") and query_lower in r["full_name"].lower()
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that rather than making the cache use conditional on accessible_only, we should have use_cache=False. We should have an assert like assert not use_cache or not query since the cache doesn't work with the query

Copy link
Copy Markdown
Member Author

@jaydgoss jaydgoss Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused by this comment because the cache is set up to work with the accessible_only variant of querying.

accessible_only is a means to query accessible repos, default query behavior was exposing repos that sentry does not necessarily have access to (public repos not selected during installation configuration).

the cache in its current form its only enabled for the accessible_only path because I didn't want to alter behavior for existing consumers.

that said cache could instead be opt in and applied to either the not query path or the accessible_only path, is that along the lines of your thinking?

Further changes here are introduced with pagination support #112591

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point is that it doesn't really make sense to make the cache specific to accessible only. We're caching all the repos, and then we're filtering to accessible only in python. So if we add use_cache=False, then in your usages you pass use_cache=True, is_accessible=True it's more general and allows us to use the cache in other places later, if we want to. In general, I feel like Claude tends to do things in an overly specific way so I just wanted to guide us away from that here.

Copy link
Copy Markdown
Member Author

@jaydgoss jaydgoss Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will alter things so that cache can be applied to either the not query or accessible_only paths.

Filtering of accessible is not handled via python filtering (its a different github api path), the "archived" filter is kind of a red herring there.

accessible vs non accessible:

accessible repo fetch client.get_repos(
accessible agnostic repo search client.search_repositories(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I misread here... although I'm confused about the split that exists in the current code because it looks like it fetches all repos from github and then filters out archived repos when it's accessible.

It seems like if there's no query, we should always be using self.get_client().get_repos? I'm not totally sure why we'd use the search api when we're fetching everything

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, discussed on slack and I understand what's going on a bit better here now.

Yeah, I think it still makes sense to have use_cache be explicit. Then whenever is_accessible is true, we can optionally use the cache or not based on it, and just have assert not use_cache or not is_accessible to make sure that we don't confuse folks who use is_accessible.

Probably it'd be nice to get rid of is_accessible completely but idk if we're relying on the behaviour implicitly somewhere.

Comment on lines +357 to +359
# accessible_only: fetch and filter accessible repos (cached)
# avoids re-fetching all pages on every debounced keystroke and
# avoids the Search API's 30 req/min shared rate limit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to one part of the frontend, I'd probably remove this from here, or move it to the api

if not r.get("archived") and query_lower in r["full_name"].lower()
)

# Query without accessible_only: existing search behavior
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this comment

repos = [r for r in repos if query_lower in str(r["identifier"]).lower()]
return repos

# No query: fetch all accessible repos (without cache)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this comment too

r
for r in all_repos_cached
if not r.get("archived") and query_lower in r["full_name"].lower()
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, discussed on slack and I understand what's going on a bit better here now.

Yeah, I think it still makes sense to have use_cache be explicit. Then whenever is_accessible is true, we can optionally use the cache or not based on it, and just have assert not use_cache or not is_accessible to make sure that we don't confuse folks who use is_accessible.

Probably it'd be nice to get rid of is_accessible completely but idk if we're relying on the behaviour implicitly somewhere.

Add explicit use_cache parameter to get_repositories instead of
implicitly tying caching to the accessible_only flag. This makes
caching an independent concern that callers opt into explicitly.
Rename method and cache key to reflect that the cache is not specific
to the accessible_only path. Remove implementation-detail comments
from get_repositories.
query: str | None = None,
page_number_limit: int | None = None,
accessible_only: bool = False,
use_cache: bool = False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_cache parameter is accepted but ignored in GitHub Enterprise integration

The use_cache parameter was added to the method signature for interface compliance but is never used in the implementation. When accessibleOnly=true is passed to the repos endpoint, the endpoint calls get_repositories(..., use_cache=True) (line 76 of organization_integration_repos.py). The GitHub integration correctly calls client.get_repos_cached() when use_cache=True, but the GitHub Enterprise integration silently ignores this flag and always calls get_client().get_repos() without caching. GitHub Enterprise users will not receive the caching benefit this PR is intended to provide.

Verification

Read github_enterprise/integration.py lines 224-254 to confirm use_cache parameter is never referenced in method body. Read github/integration.py lines 355-376 to see correct implementation using use_cache. Read organization_integration_repos.py line 76 to confirm use_cache=accessible_only is passed to all integrations.

Identified by Warden sentry-backend-bugs · CHR-VP4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix attempt detected (commit 3bf81a4)

The use_cache parameter was added to the GitHub Enterprise get_repositories signature for interface compliance, but the method implementation still ignores it and unconditionally calls get_client().get_repos() without any conditional caching logic, identical to the before state in the critical execution paths.

The original issue appears unresolved. Please review and try again.

Evaluated by Warden

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is siloed to github only, not GHE

Avoid caching the initial no-query load when accessibleOnly is set.
Cache is only useful for debounced keystroke searches, not the first
page load which should always return fresh data.
@jaydgoss jaydgoss merged commit b691b5d into master Apr 10, 2026
61 checks passed
@jaydgoss jaydgoss deleted the jaygoss/vdy-68-perfgithub-optimize-accessibleonly-repo-search-to-avoid branch April 10, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants